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Abstract 

A number of quality measures are evaluated for gray scale image compression. They are all 
bivariate, exploiting the differences between corresponding pixels in the original and degraded 
images. It is shown that although some numerical measures correlate well with the observers' 
response for a given compression technique, they are not reliable for an evaluation across different 
techniques. The two graphical measures (histograms and Hosaka plots), however, can be used to 
appropriately specify not only the amount, but also the type of degradation in reconstructed 
images. 
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1. Introduction 

The need for storing and transmitting huge volumes of data in today's computer and 
communications systems necessitates data compression in many fields ranging from medicine to 
aerospace. Data compression is an encoding process to reduce the storage and transmission 
requirements in applications. Many efficient techniques with considerably different features have 
recently been developed for both lossless and lossy compression. The evaluation of lossless 
techniques is normally a simple and straightforward task, where a number of standard criteria 
(compression ratio, execution time, etc.) are employed. A major problem in evaluating lossy 
techniques is the extreme difficulty in describing the type and amount of degradation in 
reconstructed images. Because of the inherent drawbacks associated with the subjective measures 
of image quality, there has been a great deal of interest in developing a quantitative measure, either 
in numerical or graphical form, that can consistently be used as a substitute. We would like to 
have such a measure not only to judge the quality of images obtained by a particular algorithm, but 
also for quality judgment across various algorithms. The latter task is definitely more challenging 
since a wide range of image impairments is involved. An extensive survey and a classification of 
the quality measures that appeared in the relevant literature are given in [1]. 

It is known that the mean square error (MSE), the most common objective criterion, or its variants 
do not correlate well with subjective quality measures. A major emphasis in recent research has 
therefore been given to a deeper analysis of the human visual system (HVS). The HVS is too 
complex to fully understand with present psychophysical means, but the incorporation of even a 
simplified model into objective measures reportedly leads to a better correlation with the response 
of the human observers. 

We attempt to evaluate the usefulness of some of the objective quality measures listed in [1] 
through a set of experiments. 

2. Image Quality Measures, Compression Techniques, and Test Images 

The quality measures included in our evaluation are listed in Table 1. They are all discrete and 
bivariate, i.e., they provide some measure of closeness between two digital images by exploiting 
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the differences in the statistical distributions of pixel values. F(j, k) and F (j, k) denote the samples 
of original and degraded image fields. 


Table 1. Image Quality Measures 


Average Difference 

M N 

^ = 11 [F(j,k)-F(j,k)]/MN 
j=l k=l 

Structural Content 

M N M N 

AC = X Z rF(j.k)] 2 / X X [F(j.k)l 2 

j-1 k=l j=l k=l 

N. Cross-Correlation 

M N M N 

NK = X X F(j,k)F(j,k)/£ J [F(j.k)] 2 

j=l k=l j=l k=l 

Correlation Quality 

M N M N 

CQ = X X F(j,k)F(j,k)/£ X F(j,k) 

j=l k=l j=l k=l 

Maximum Difference 

MD = Max{IF(j,k) - F(j,k)l} 

Image Fidelity 

M N M N 

if=i-(X X [Fak)-F(j,k>] 2 /X X [ p a k )r) 

j=l k=l j=l k=l 

Weighted Distance 

WD: Every element of the difference matrix is normalized in 
some way and Lj-norm is applied [1]. 

Laplacian Mean Square Error 

M-l N-l M-l N-l 

LMSE = X X [0{F(j,k)}-0{F(j,k)}] 2 / X X [0{F(j,k)}] 2 

j=l k=2 j=l k=2 

Peak Mean Square Error 

M N 

PMSE = ^X X [F(j.k)J - F(j,k)] 2 / [Max{F(j,k)}] 2 

j=l k=l 

N. Absolute Error 

M N M N 

NAE = X X IO(F(j,k))-0{F(j,k))l/X X IO{F(j,k))l 

j=l k=l j=l k=l 

N. Mean Square Error 

M N M N 

NMSE = X X [0{F(j,k)}-0{F(j,k))] 2 /X X [°(F(j,k)}] 2 

j=l k=l j=l k=l 

Lp-norm 

M N 

L P = (dfjI X IFfj,k)-F(i,k)|P} l/ P,p = 1,2,3 

j=l k=l 

Hosaka plot 

frn Hfi OTufiiEiHBEim umiiuii ci) w* mu hu iij 1 1 iKulHHH 

Histogram 

Another graphical quality measure. Gives the probability distribution 
of the pixel values in the difference image. 

Note: For LMSE, 0{ F(j,k) }=F(j+ 1 ,k)+FG- 1 ,k)+F(j,k+ i)+F(j,k- 1 )-4F(j ,k). For NAE, NMSE, 
and L 2 -norm, 0{F(j,k)} is defined in three ways: (1) 0{F(jJc)}=F(j,k), (2) 0{F(j,k)}=F(j,k) 1/3 , 
(3) 0{F(u,v) } =H { (u 2 +v 2 ) 1/2 }F(u,v) (in cosine transform domain). 
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Among the few models of the HVS that have been developed, we chose the one proposed by Nill 
for dealing with cosine transforms. The function for the model is defined as [2] 

0.05r°' 554 , for r<7 

H(r)= 23 

e -9[lk)g 10 r-log,o 91] 3 j ot ^ 7 , 

where n=(u 2 +v 2 )l^, and u, v are the coordinates in the transform domain. The subimage structure 
weighting factor Wj in the original model was not used in our computations because we wanted to 
investigate the effect of H(r) alone. Since Wi is proportional to the intensity level variance of 
subimage i, a separate analysis is needed to determine a suitable proportionality constant. 


Table 2 Image Compression Techniques 

JPEG Fourth public release of the Independent JPEG Group’s JPEG software 

EPIC Vision Science Group, The Media Laboratory, MIT 

RLPQ Department of Computer Sciences, University of North Texas 

SLPQ Department of Computer Sciences, University of North Texas 


The implementations of the image compression techniques are given in Table 2. Both JPEG and 
EPIC belong to the class of transform coding techniques. The former performs the discrete cosine 
transform and the latter a wavelet transform. RLPQ and SLPQ contain several modifications to the 
T .a piarian pyramidal decomposition and use a loose wavelet basis. After quantization, they employ 
arithmetic coding with a specifically tuned adaptive predictive model to compress the pyramid. 

It should be noted that the choice of the compression techniques for an investigation of the 
performance of quality measures (especially those that are graphical) is important since it is 
desirable to include techniques which produce different types of impairments in the reconstructed 
images. Our purpose is to see how well the measures are able to describe image distortions of 
nnsimilar nature. As we shall discuss later, the four codes in Table 2 serve this purpose. 

The information about the three test images that we used can be seen in Table 3. Lenna and 
Fingerprint are in the set of the National Imagery Format Test Images. The third image, hurricane 
Gilbert, was obtained from the U.S. Navy. 


Table 3 Test Images 


Image 

Source 

Size(bytesxbytes) 

Pixel Length(bits) 

Spatial Frequency 

Lenna 

NITF 

512x512 

8 

14.07 

Gilbert 

US Navy 

512x512 

8 

31.25 

Fingerprint 

NITF 

512x512 

8 

59.37 
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The spatial frequency for a given image is defined as follows [3]: 

Consider an MxN image, where M = number of rows and N = number of columns.The row and 
column frequencies are given by 


and 


J M-1 N-l 

xfcrX X [F(j,k)-F(j.k-l)l 2 

j=0 k=l 


J N-1 M-l 

X X [F(j,k)-F(j-l.k)] 2 
1=0 ]=1 


The total frequency is then 

Spatial frequency = -\/(Row_Freq) 2 + (Column_Freq) 2 . 

This definition of frequency in the spatial domain indicates the overall activity level in an image. 


3. Performance Of Quality Measures 

The gray scale image data set was obtained by coding and decoding the three test images with the 
compression codes listed in Table 2. For each test image, seven different compression ratios were 
selected for degradation. They range from 10:1 to 70:1 with an increment of about 10. (Our 
original intention was to use the ratios 10:1, 20:1, 30:1, 40:1, 50:1, 60:1, and 70:1, but because of 
the inflexibility in using the JPEG parameter, we ended up with some different ratios.) 

The photographic samples of the degraded images were first subjectively evaluated in an office 
environment by ten observers who were chosen from the graduate students and faculty having 
some background in image compression. They were asked to rank the images in two ways: 
Within each technique and between the four techniques for a fixed compression ratio. The mean 
rating of the group for an evaluation was computed by 

10 10 

r=(£ MkVcX n *)’ 

k*l k=l 

where Sk = the score corresponding to the kth rating, nk = the number of observers with this 
rating, and 10 = the number of grades in the scale. No limits were imposed on viewing time or 
distance for the observers. 

Table 4 shows the correlation between the numerical objective quality measures and the subjective 
evaluation. As a measure of the extent of the linear relationship, the Pearson product-moment 
correlation coefficient (r) was used. The possible values of r are between -1 and +1 ; the closer r is 
to -1 or +1, the better the correlation is. 

The coefficient values in Part (a) of Table 4 indicate that the quality measures can be put into three 
groups according to their performance: 

Group I: AD, SC 

Group II: NK, CQ, LMSE, MD 

Group m: WD, PMSE, IF, NAE, NMSE, Lp. 
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Table 4. (a) Correlation coefficients for each technique 


1) Lenna 

tfeasure/Code I JPEG I EPIC I RLPi 
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Table 4. (b) Correlation coefficients across techniques 


1) Lenna 
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The measures in Group I cannot be reliably used with all techniques as the sign of the correlation 
coefficient does not remain the same. Group II measures are consistent, but nevertheless have 
poor correlation with the observers' response for some of the techniques. Among the useful 
measures in Group HI, NMSE(HVS) is the best one for all the test images. Except for a single 


61 
















































case, the incorporation of the HVS into NMSE makes the correlation slightly stronger. For the 
other two measures NAE and L2, however, there is no such improvement (In fact the visual 
model has an adverse effect on NAE.) The results reported in [4] and [5] support our conclusion 
that the HVS model does not always improve the correlation, and when it does, the gain is small. 
The nonlinear filter (O^, on the other hand, seems to have a random behavior, but usually leads to 
a weaker correlation. As IF is defined in terms of NMSE, the results for these two measures are 
identical. It has been found that PMSE establishes the same relationship as well. 

Part (b) of Table 4 is rather disappointing, and the information that can be extracted is limited. As 
the compression ratio is increased, the measures perform much poorer. This observation is not 
surprising because different techniques introduce different types of degradation into the 
reconstructed images. Since the metrics combine all the pixel differences between two given 
images into a single number, one cannot expect to know much about the annoyance experienced by 
the human observer. In our experiments, for instance, although JPEG was the code for which the 
errors were always the smallest, the observers found the tile effect very objectionable in Lenna, yet 
favored blockiness in the higher frequency images Gilbert and Fingerprint 

To the best of our knowledge, histograms and Hosaka plots are the only two image quality 
measures that are graphical. Before we evaluate their performance, a specification of the type of 
impairment caused by the techniques is needed. Because of space limitation, the results for only 
the first test image will be discussed here. Four degraded versions of Lenna for the highest 
compression ratio (69:1) are given in Figure 1. The original image is also included for a 
comparison. The major types of degradation in the images are blockiness with JPEG, blurriness 
with EPIC, both fuzziness and blockiness with RLPQ, and fuzziness with SLPQ (The term 
fuzziness is used in the sense of equal amount of blurriness over the entire image). 

A histogram of the compression error is constructed by plotting the number of times a specific 
value occurs in the difference image versus the value itself. Typically, it looks like a Gaussian 
curve; the more it resembles a spike at x=0, the greater the fidelity of the reconstructed image. The 
seven histograms in Figure 2 were obtained using JPEG. They clearly depict the increase in the 
amount of blockiness as the compression ratio goes up. The concentration of low intensity pixels 
for the lowest ratio is gradually reduced and the distribution becomes more uniform. Our 
experience has shown that histograms may also be used to specify different types of degradation in 
images. In Figure 3, the histograms with low intensity pixel concentrations are associated with 
RLPQ and SLPQ, and they are in contrast with those corresponding to JPEG and EPIC. The 
uniform fuzziness over the entire image, it is understood, leads to a spiky histogram. 
Nevertheless, the similarity between the histograms in each pair makes it difficult to distinguish 
between the artifacts involved. 

To construct a Hosaka plot, or an h-plot, we measure a number of features of the reconstructed 
image and compare these with the corresponding features in the original image [6]. The difference 
between the two feature vectors generates a vector error measure, which, unlike scalar quantities, 
allows for a description of not only the amount, but also the type of degradation. In the process, 
the original image is first segmented into blocks whose variance is less than some specified 
threshold. These blocks are then grouped together to form a number of classes which depend on 
the size of the blocks. Two features are computed for each class in both the original and the 
reconstructed images. One of them is related to the mean intensity values and the* other is the mean 
standard deviation. The h-plot is constructed by plotting the errors in the corresponding features in 
polar coordinates. The radius denotes the feature error, the left and right half planes contain the 
vectors associated with standard deviations and means, respectively. 

It is reported in [6] that when noise is added to an image, the area of the h-plot is proportional to 
the image quality, but the structure of the diagram depends on the type of distortion. If an image is 
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blurred, on the other hand, the pattern on the right side of the diagram remains fixed and increases 
in magnitude as the blurring increases while the left side is much less predictable. 

The h-plots in Figure 4 were obtained using Lenna for all compression techniques and ratios. In 
each diagram, the length of a radius is 2.75 units. The bloclriness is reflected on the right side of 
h-plots, whereas, the effect of blurriness can be traced on the left. By a simple comparison, we are 
able to see the way each code reduces the fidelity of the image. One can even learn how the 
distortion is distributed in the reconstructed images by looking at the relative lengths of the 
components along the axes. For example, it is evident that JPEG preserves the high frequency 
components (the feathers) of the image, whereas RLPQ induces uniform bloclriness. Such 
information is extremely helpful considering the sensitivity of the human observer to the location of 
the image error. For the construction of the h-plots in Figure 4, the two parameters, the initial 
block size N and the variance threshold T, were chosen as 16 and 10, respectively, as in Hosaka's 
or Farrelle's work [6], For high compression ratios, the h-plots for JPEG and RLPQ indicate that 
it may be worth trying larger values for these parameters. 

4. Conclusions 

The results of an evaluation concerning the usefulness of a number of objective quality measures 
for grayscale image compression have been presented. It is understood that although a group of 
numerical measures can reliably be used to specify the magnitude of degradation in reconstructed 
images for a given compression technique, an evaluation across different techniques is not 
possible. This is because a single scalar value cannot be used to describe a variety of impairments. 
A simple analogy would be the futility in comparing apples with oranges. lire two graphical 
measures, however, are fairly successful in specifying the type of degradation. Hosaka plots, in 
particular, provide a good indication of how images are degraded. A combination of numerical and 
graphical measures may prove more useful in judging image quality. There is also a need for the 
development of new graphical measures with superior judgment capabilities. Further research in 
these areas is now ongoing. 
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